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Abstract 


We present one of the first algorithms on model based reinforcement learning and 
trajectory optimization with free final time horizon. Grounded on the optimal con¬ 
trol theory and Dynamic Programming, we derive a set of backward differential 
equations that propagate the value function and provide the optimal control policy 
and the optimal time horizon. The resulting policy generalizes previous results in 
model based trajectory optimization. Our analysis shows that the proposed algo¬ 
rithm recovers the theoretical optimal solution on linear low dimensional problem. 

Einally we provide application results on nonlinear systems. 

1 Introduction 

Trajectory optimization is one of the most active areas of research in machine learning and control 


theory with a plethora of applications in robotics, autonomous systems and computational neu¬ 
roscience. Among the different methodologies. Differential Dynamic Programming (DDP) is a 
model based reinforcement learning algorithm that relies on linear approximation of dynamics and 
quadratic approximations of cost functions along nominal trajectories. Even though there has been 
almost 45 years since the fundamental work by Jacobson and Mayne on Differential Dynamic Pro¬ 
gramming m, it is a fact that research on trajectory optimization and model based reinforcement 
learning is performed nowadays by having a main ingredient DDP. In the NIPS community, recently 
published state of art methods on trajectory optimization use DDP to perform guided policy search 
m and data-efficient probabilistic trajectory optimization El. Earlier work on DDP includes min- 
max a, control limited El, receding horizon 013, and stochastic optimal control formulations 


013. 


Despite all of this research on trajectory optimization using model based reinforcement learning 
methods such as DDP, there has not been any effort towards the development of model based trajec¬ 
tory optimization algorithms in which the time horizon is not a-priori specified. The time horizon is 
one of the important free tuning parameters in trajectory optimization algorithms and in most case 
is manually tuned based on the experience of the engineer. 

In this paper we present a new algorithm on model based reinforcement learning in which optimiza¬ 
tion is performed with respect to control and the time horizon. While free time horizon DDP has 
been initially derived by Jacobson and Mayne in m, the resulting algorithm is not implementable 
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and it relies on the assumption that the initialization of the algorithm starts close to the optimal con¬ 
trol solution. This will become more clear in the next section as we present our analysis on free final 
time model based trajectory optimization. 


2 Problem Formulation and Analysis 

We consider model based reinforcement learning problems in which optimization occurs with re¬ 
spect to control and time horizon. In mathematical terms these problems are formulated as follows; 


= min J(x(-), u(-)) = min 

u(-) u(-) 




rtf 




'to 


, ( 1 ) 


where the term ) is defined as <i)(x(fy), z/, f/) = f), tf) -f i/ 'il;(x(tf ),tf ), 

is the terminal cost, is the terminal constraint and i/ is the correspond¬ 

ing Lagrange mulitplier. L(x(f), u{t),t) is the running cost accumulated along the time horizon tf, 
which is not specified a-priori. The cost function J(x(-), u(-)) in ([T]i is minimized subject to the 
dynamics: 

u(f), f), Xo = x(fo). (2) 


where x € M" is the state and the u G R.™ is the control of the dynamics. Note that the value 
function V (x(fo), is now a function of the Lagrange multiplier v and the terminal time tf. 

This is important for the derivation of the free time horizon algorithm since expansions of the value 
function are computed not only with respect to nominal controls and state trajectories but also with 
respect to nominal D and tf. 


2.1 Derivation of Differential Dynamic Programming (DDP) with Free Final Time 

Our analysis and derivation of the free time horizon model based reinforcement learning is in con¬ 
tinuous time. As it is shown, a set of backward ordinary differential equations is derived that back 
propagates the value function along the nominal trajectory. In particular, given a nominal trajectory 
(x(-), u(-)) with nominal Lagrange multiplier D and terminal time if, we start our analysis with the 
linearization of the dynamics as follows; 

+ i5x(t), u{t) + 6u{t), t) 

= Fx(x(f),u(f),<)(5x(<)-f Fu(x(f),u(f),f)5u(f). (3) 

All the quantities in the derivation later are evaluated at (x(f), u(f), v, if) unless otherwise specified. 
Since our derivation is in continuous time, we consider the corresponding Hamilton-Jacobi-Bellman 
equation: 


dV{x{t),t;iy,tf) 

dt 


= min 

u{t) 


n{x{t),t;i^,tf) , 


(4) 


under the terminal condition y(x(f/), f/; f/) = ^{x{tf),i',tf), and with the Hamiltonian func¬ 

tion 'Hifx.it), t\ v,tf ) defined as follows: 

n{x{t), t] V, tf) = L{xit),u{t),t) H 14(x(f), f; v, tff F{x{t),u{t),t). (5) 

We take expansions of the terms on both sides of |4] around {x,u.,D,tf). Notice that this is in 
contrast with the derivation of free final time DDP in IT] in which the expansion takes place around 
(x*, u*). Hence, the key assumption in |T| is that u is close to the optimal control u*, which makes 
the algorithm hard to implement, especially when the optimal tf is not known a-priori. Moreover, 
expansion of the Hamiltonian 'H around u* yields ^|u=u* = 0, which results in dropping terms 
from the derivation. 
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The left-hand side of (|4|i can be expanded as 


o O / 

— + Sx{t), + Sv,if + Stf) Ri — (v(x(t), t; T, if) + V^Sx{t) + V^Sv + Vt^Stj 


T T 



"(5x(t)" 

Sx{t) Sv Stf 

Vux y^u y^t f 



- 

ytfx ytfi' ytftf 


- Stf . 


(6) 

Next we make use of the fact that 

= !<■> + ^(O’FWO.aW.*) ^ -!(■) = -^t) + |^(■)V(x(t),u(^),t). (7) 

Based on the equation above we have that 
d 

- -^V{x{t) + Sx{t),t-, V -f 8v, tf + Stf) 


d 

df 

--f- 

df V2 


V(x{t),t; v,tf) + Sx{t) + V^,Sv + VtjStf 
5x{t) 5iy 6tf 


^xt f 

Vvx Vw Vut f 


Vtfx Vtfv Vtftf 

F + Sx{t)' + 5vV,^F + 5tfVt,^F 

1 

+ - 


6x{t)' 

5v 

V . 


( 8 ) 


r 7 


VxxxF VxuxF VxtfxF 


'5yi{t)~ 

Sx{t) Sv Stf 


VvxxF VuvxF VutfxF 


5u 

- 


VtjxxF VtjvxF VtjtfxF 


- Stf . 


The next step is to work with the expansion of the right-hand side of the HJB equation in (|4|. In 
particular, we have that 


V^{x(t) + 6x{t),t; V -f 8v, tf + Stf) Ri V^{x{t),t; v, tf) + V^^Sx{t) + V^^Sv + V^tfStf 


Sx{t) Sv Stf 


y^xu yxxtf 


■(5x(i)" 

yxux yxi/i' yxiftf 


Sv 

1 

‘T-. 

‘+-, 


- Stf . 


(9) 


In addition, the running cost and the dynamics are expanded as follows: 


L{x{t),u{t),t) = L{x(t) + Sx{t), u(f) + Su(t), t) Ri L{x(t), u{t),t) -f L^Sx{t) + L^Su{t) 


Sx{t) Su(t) 


Sx(t) 

Su(t) 


( 10 ) 


F{x{t), u{t), t) = F{x{t) + Sx{t),u{t) + Su{t),t) Ri F{x{t), u{t),t) + F:^Sx{t) + FuSu{t). 

( 11 ) 


Therefore, the right hand side of (|4|i can be expressed as 

1 


min L(x(f), u{t),t) + L^Sx{t) -f L^Su{t) + 

6u(t) 


Sx{t) 5u(f) 


L, 


Sx{t) 

Suit) 


+ V^F + V^ F^Sx{t) -f 14 FuSu{t) + Sx{t) V^^F + Sx{t) V^^F^Sx{t) + Sx{t) V^^FuSu{t) 

+ Sv V^^F + Sv Vy^F^Sx{t) + Sv V^^F^Su{t) + StfVt^^F + StfVtj^F^Sx{t) F StfVtj^F^Su{t) 


Sx(t) Sv Stf 


VxXxF VxUxF VxtjxF 

Vux.xF V^vxF VvtfxF 

VtfxxF VtfuxF VtftfxF 


'Sx{t)' 

Sv 

. Stf . 


-f H.O.T. 


( 12 ) 
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Equating (l8]l with (fTSl i and cancel like terms, we get 


dt\ 


V + <5x(t) 5v Vt^Stf + — 


i5x(t) Sv 6tf 


^^x 


■(5x(i)" 

K/x y^u Vutf 


6u 

ytfx ytfi/ ytftf 


- stf. 


min \ L + LJx{t) + LJu{t) + - 


-^xx -^xu 

i5x(f) 

Lux -^uu 

(5u(f) 


i5x(i) 5u(i) 

+ (5x(i)' 14 xE'x<5x(<) + Sx{t)' V^:^F^Su{t) + Sv"V^:^F^Sx{t) + 5v' V^:^F^Su{t) 

+ StfVtj^F^Sx{t) + 6tfVtj^F^Su{t) 


+ Fx< 5 x(i) + 14 FuSu{t) 


(13) 


To find the 6u{t) that minimize the equation, we take derivative of the right hand side of ( fOl) and 
set it to 0, 

0 = Tu + -^uu^u(f) + (^^ux + 2^^^^ F^Vxx)Sx(t) + F^Vx + F^Vxvdv + F^VxtfStf. 

(14) 

The update law for the control is thus given by 

6u{t) = 1(f) + Kx(f)()x(f) + K^{t)6t^ + Ktf{t)Stf. (15) 

where the terms 1(f), Kx(f), Ki,(f) and (f) are defined as follows 

1(f) = -L-i(Lu + F>x), Kx(f) = -T-^(iiux + + FM, 

K,(f) =-L-iF>x., K*,(f) = -L-iF>xt,. (16) 

Note that L^u is guaranteed to be invertible if the running cost L = g{x.) + u Ru, where i? > 0. 
This type of cost is normal for a mechanical system where we would like to minimize the energy 
cost of the control. 


Substitution of the optimal policy variation 5u back to the HJB equation results in a set of backward 
ordinary differential equations that propagate the expansion of the value function which consists of 
the terms V, 14,14:x) 14i/) Vxi^, V^tf and 14tj- These backward differential equations are given as 
follows 


l/ = L-irLuul, 

df 2 ’ 

-^14 = ix - K^^LuuI + F>x, 

-^k = kX 

= ^xx ~ Kjj.LuuKx + 2i^5j.T4;x5 
d T 

“ —^tfFuuKtf, 

= -l^xuKi^ + F^Vxv + Yx.xFuKu 
~^14;t/ = LxuK.tf + F^V-x.tf + 14x.FuK 




-Au. 


^T T 


(17) 


where all the quantities are evaluated at (x(f), u(f), E, if). To numerically solve the equations in 
(ITtI i one has to compute the terminal conditions. In the next section we present the derivation for 
the terminal condition and provide an overview of the algorithm. 
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2.2 Terminal Conditions 


The terminal conditions can be determined by the following procedure. 


From 

C(x(fo),fo;J^,i/) = min J(x(-),u(-)) = min<j / L{x{t),u{t))dt + \, (18) 

u(-) u(-) 


we have that for any t € [to,tf], 

C(x(f),f;i/,</) = min J(x(-),u(-)) = mini [ L(x(s), u(s))ds + $(a;(f/),f/)!. (19) 

u( ) u( ) [ Jf J 


'to 

rtf 


Therefore, 


V(x(tf) + Sx(tf), if;v + 5v, if + 6tf) 

ftf+Stf 


= min 

u(.) 


ftf+dif _ _ 'I 

J u(f), t)dt + $(x(f/ + Stf), D + 6if,tf + 5tf )j 


( 20 ) 


L{x{if), u{if),t)6tf + $(x(f/) + Sx{if) + 5i{if)Stf, V + 8v, if + 6tf) 

~ T(x(f/), u{if),t)6tf + <^>{xiif), V, if) + $^((5x(f/) + F{x{if), u{if), if)6tf) 
+ u(f/), iff 6v + (x(f/), uiif), if)Stf 


'(5x + FStf~ 

T 

’$xx 

^Xtf 

'(5x + F8tf 

6v 




8v 

- % - 



^tft,_ 

- % - 


V{x{tf) + 5x{tf),tf]v + 5v,tf + 6tf) = $ + + (L + + ^ts)5tf 


p n 


^xx 


^xi/ + ^xx^ 

(5x(t) 8v 8tf 


$1.X 


^irtf + ^uxF 

- 


+ F^ 


+2<S>t^^F + F^<^>^^F_ 


'6x{ty 
6v , 
. ^tf . 
( 22 ) 


where x{if +6tf) is evaluated by x{if) + 6x{if) +x{if)6tf and x(f/) = F{x{if), u{if),if). The 
arguments of the functions in the last line of equations are the same as those in the previous equations 
and are thus omitted. The minimization with respect to u is dropped on the third line of equations 
because T(x(f), u(f))df is evaluated by L(x(f/), and the latter is only a function 

of the nominal control. Note that this approximation is relatively rough, since we approximate 
jtf+stf u(f))df by L{x{if), u{if))Stf instead of L{x{if), u{if))Stf and follow up with an 

expansion on x{if) and u{if). But the simulation results suggest that such level of approximation 
is good enough. 


Hence, at t = tf, the terminal conditions are 
^(x(f/)J/; = <^{x{if),D,if), 

V^{x{ij),ij]v,ij) = $x(x(f/),F,f/), 

V^{x{if),if]v,if) = ^^{x{if),v,if), 

Vtf {x{if), if] V, if) = L{x{if), u{if)) + $x(x(</), P, if)^ F{x{if),u{if)) + (x(f/), F, if), 

^x(x(f/), if] V, if) = $xx(x(f/), P, if), 

Vtftf (x(</), if] P, if) = <l>tftf{x{if),P, if) + 2$t^x(x(f/), P, if)Fix{if),u{if)) 

+ u(f/))^$xx(x(f/), P, if)F{x{if), u{if)), 

V^,,(x(if),if]P,if) = $xi.(x(f/),F,f/), 

V^tf (x(f/), if] P, if) = $xt^ ix{if), P, if) + $xx(x(f/), F, f/)F(x(f/), u(f/)), 
y-rtf (x(f/), if] p, if) = {^{if), P, if) + $^x(x(f/), P, if)F{x{if), u{if)). 

(23) 
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Su 

= -c 

Ki.(x(fo),fo; v,tf) 

3 

0 

■40 

0 

-to 

-1 


5tf_ 




Vtf{yi{tf),tf]v,tf)_ 


Given the boundary conditions of the value function and its derivatives, we can back-propagate the 
differential equations we derived earlier to find their values. Our next step is to update the control 
through (ffSl) . and in order to do so, we need to find the update law of Siy and 6tf. 

We follow the derivation in jTl and set 

(24) 

where C G [0,1] is introduced to ensure that the update of 5v and Stf are not too large. 

3 Simulation Results 

3.1 Double Integrator 

We first apply the algorithm on a simple system, namely, the double integrator. We compare our 
numerical result with the analytical solution to verify the algorithm. The dynamics is given by 
it = X 2 -, and X 2 = u. Initial condition is x(0) = [a;i(0); 2 : 2 ( 0 )] = [0; 0]. The cost J = /g^(l + 
Ru^)dt. Terminal constraint is a;i — 1 = 0. Introducing the Lagrange multiplier ly, the cost can 
be reformulated as J = i'{xi — 1) + (1 + ^Ru^)dt. Given different values of R, we can hnd 

different optimal cost and terminal time. In particular, let i? = 0.1,1,10, the corresponding terminal 
times are 0.819,1.456, 2.590, respectively. Optimal control and tf per iteration when i? = 0.1,1,10 
are shown in Figure [T] 

Now we solve this problem analytically to verify the simulation results. Denote the co-states by 
A = [Ai; A 2 ], the Hamiltonian is given by 


1 9 

iT = 1 -f 2^^^ Aia;2 + A 2 U. 


The co-states satisfy the adjoint equations 


Ai = -|^ = 0, 

UXi 

A - - A 

A2 — —- — —Ai. 

0X2 


(25) 


(26) 


(27) 


Utilizing the Pontryagin’s minimum principle, the optimal control u* can be calculated from 0 = 
^ = Ru + A 2 . Hence, u* = —X 2 /R. Transversality conditions are such that Xi(tf) = v, 
A 2 (f/) = 0, H{tf) = 0. Given the previous information, we are ready to solve the problem. From 
Ai = 0 and Ai(f/) = v, we get 

Ai(f) =u,t£ [0,f/]. 

Then from A 2 = —Ai and X 2 {tf) = 0, we have X 2 {t) = v{tf — t). Therefore, 


u* = -X 2 /R= 


(28) 


Note that the optimal control is a linear function of t. Furthermore, boundary conditions yields 
t* = (|i?)3 and ly* = - jt* = When R = 0.1,1,10, = 0.8190,1.4565,2.5900, 

respectively, which is consistent with the numerical simulation results presented in the control plots 
in Figure [T] 

3.2 Cart Pole 


In this subsection, we apply our algorithm on the inverted pendulum on a cart, as known as the cart 
pole problem, with M = 10 the mass of the cart, m = 1 and I = 0.5 are the mass and length of 
the pendulum, g = 9.8 the gravitational acceleration and u the force applied to the cart. The state 
X = [a:, X, 9,9]. The goal is to bring the state from x(0) = [0,0, tt, 0] to p = [0,0, 0,0], which 
represents the case where the pendulum is pointing strait up. The cost function is given by 

•7=2^ [ct-f (x-p)^Q(x-p)-f u^i?uu]-f A^([x3(f/);x4(f/)] - [p3;p4]). 
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t| per iteration 



(a)R=0.1 

t, per iteration 


(d) R=1 


(b) R=0.1 


Controi 



(c) R=1 


t, per iteration 



Iterations 


(f) R=10 


Figure 1: Figure [Tal fT^ [Tel show the numerical optimal control for the cases when R — 0.1, 1 ,10, 
respectively. Figure dbllMlIIS present the tf per iteration for the cases when R = 0.1,1,10, 
respectively. Dashed red lines represents the according analytical optimal tf. 


where Q = diagjO, 0,1,1} and R^ = 0.01. Initial values are given as f/ = 1, A = [0, 0] . The 
multipliers 7 = 0.05 and e = 0.05. We run the algorithm for 300 iterations and the convergence is 
achieved at around 200th iteration. Figure 1^ presents the optimal control u*. The corresponding 
optimal trajectories of the states are depicted in Figure |2] Cost and t / per iteration are shown in 
Figure[^and|^ respectively. 


X X 



Time in sec Time in sec 


Figure 2: Optimal trajectories of the states in blue. Red lines represent the desired terminal states. 


Control 



Cost per iteration 


(a) Optimal control of the cart 
pole system. 



(b) Cost per iteration. 


tf per iteration 



(c) tf per iteration. 


Figure 3: Cost and tf per iteration for the cart pole system. 
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3.3 Quadrotor 


The dynamic model of the quadrotor includes 16 states: 3 for the position (r = {x, y, z)^), 3 for 
the Euler angles ($ = {(j), 9, ip) ), 3 for the velocity (f = (x, y, z) ), 3 for the body angular rates 

($ = (p, q, r) ) and 4 for the motor speeds {fl = (wi, a; 2 , ujs, W 4 ) ). The corresponding dynamics 
of the quadrotor is given as follows: 

dx 

— = /(x) + Gu, (29) 

where x = [r, $, f, T*, 11 ] G and u = (ui, M 2 , ws, W 4 ) G is the control vector, where 
Ml represents the thrust force, and M 2 ,M 3 ,M 4 represent the pitching, rolling, yawing moments, 
respectively. The corresponding cost function is defined as J = ^ /o^ [^t + (x — p) Q(x — 
p) + U^i?uu] + 5 (x(l/) -pyQf{x{tf) - p) + A^([xi(f/);...;x 6 (f/)] - [pi]... ] pe]), where 
p = [pi;... iPie] S denotes the desired terminal states. In the simulation, we set 


p{i) 


ri,i = 3; 

\ 0 , otherwise, 


and Qf{i,i) 


n0^z = 1,2,3; 

I 10®,1 = 4,...,9; 

I 10®,! = 10,11,12; 

to, otherwise. 


(30) 


and all the off-diagonal terms are assigned to 0. Q = O.OIQ/. Ru = 0.00017. 7 = 0.02 and e = 
0.02. The desired terminal state p is chosen for the quadrotor to execute the take-off maneuver. 50 
iterations are included to ensure the convergence and the cost per iteration is presented in Figure Bbl 
The corresponding optimal state trajectories are shown in Figure |4] Optimal control u is illustrated 
in Figure |5a] t / per iteration is presented in Figure |5c] 


Xl 


x5 


x9 


x13 


x2 


x3 


x6 


x7 




x4 


- -9 - 


x8 



x12 


x16 


Time in sec Time in sec Time in sec Time in sec 


Figure 4: Optimal trajectories of the states of the quadrotor in blue. Dashed red lines represent the 
desired terminal states. 
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